Here we are concerned with correcting for allometric relationships in morphometric analyses. Many traits are correlated with body size, for example, and studying variation in such metrics must account for the fact that part of the differences we see are due to differences in some underlying quantity (e.g.Ā body size) through allometry. For example, do male lizards have wider heads than females just because they are bigger? Or is there something more that body size does not account for?
Many ways have been proposed to correct for body size, such as dividing by body size, taking the residuals of a linear regression against body size, or principal component analysis. All of these methods suffer from problems, as outlined by Chan and Grismer (2022), who recently proposed to resurrect a āproperā allometric correction already shows as superior in the 80s, and which is based on a logarithmic regression.
However, this method, together with all the previous ones, suffers from some important caveats that become apparent when studying (and comparing) multiple groups (1) that differ in their allometric relationship with the horizontal axis and/or (2) that do not occupy the same range along the horizontal axis. Here we argue that these two problems cause all of the proposed corrections to be misleading in the context of group comparison (e.g.Ā studies of sexual dimorphism where the groups are males and females), pretty much for the same reasons.
We will cover both in order with examples. In what follows, we will use male-female differences as our main working example.
# But first we setup R
rm(list = ls())
library(tidyverse)
theme_set(theme_classic())
# Load useful custom functions
source("functions.R")
set.seed(42) # for reproducibility
Let us start with an example where there is no problem. The
allometric relationship is the same between the sexes, and both males
and females cover the entire range of body sizes, or variable
x. We simulate data for males and females using some
accessory function we have created (whose source code the reader can
check in functions.R but that is not the point here).
# Let us first simulate some data
data <- simulate_data(intercept_f = 0.1, intercept_m = 10, slope_f = 0.5, slope_m = 0.5, sd_y = 1)
# Check them out
data
## # A tibble: 2,000 Ć 3
## sex x y
## <chr> <dbl> <dbl>
## 1 female 31.9 18.4
## 2 male 31.9 26.5
## 3 female 22.2 12.2
## 4 male 22.2 21.5
## 5 female 26.8 12.5
## 6 male 26.8 22.8
## 7 female 28.2 14.3
## 8 male 28.2 21.2
## 9 female 27.0 12.8
## 10 male 27.0 24.3
## # ā¹ 1,990 more rows
If we plot thatā¦
# Plot
plot_data(data)
As you can see we have a variable x, which will serve as
our horizontal axis, and variable y that is related to
x and differs between the sexes. (The straight lines are
linear regressions for each sex.) This case is not a problem in the
sense that if we draw a single straight line through the whole dataset
and compute the deviation of male and female y to that
line, we essentially remove the effect of x. If we perform
a global regression through the entire dataā¦
# Here is the global straight line
plot_data(data, midline = TRUE)
And then look at the deviations between y and that
straight lineā¦
# Plot the corrected values
plot_data(data, midline = TRUE, correct = TRUE)
We see that the relationship with x has been
successfully removed but there still are male-female differences. That
is our sexual dimorphism that is not accounted for by body size! Hence,
in that case using residuals of a global regression analysis is fine to
remove the effect of body size differences between the sexes.
Things get more complicated when the relationship (irrespective of
intercept) is not the same between the groups. This could be e.g.Ā a
non-linear, allometric relationship, where y increases
faster with body size in males than in females. Let us simulate some
non-linear data.
# Simulate non-linear relationship this time
data_nl <- simulate_data(intercept_f = 0.1, intercept_m = 10, slope_f = 0.5, slope_m = 0.5, sd_y = 1, fun = function(x) exp(x / 7))
# Note: the `fun` argument allows us to transform otherwise linear data.
Here, we transformed the previously linear data into some exponential-looking data.
# Plot the allometric data
plot_data(data_nl, midline = TRUE, linear = FALSE)
# Note: setting `linear` to FALSE makes sure we fit an allometric model and not a linear one
The lines are now fitted using an allometric model, which has been regarded as the best correction for allometry even for linear relationships (see Chan and Grismer, 2022). Indeed, those lines go pretty well through both sexes, but fitting a global regression and taking deviations from it gives us the following:
# Plot the corrected allometric data
plot_data(data_nl, midline = TRUE, correct = TRUE, linear = FALSE)
Which is less than ideal as it did not remove the relationship with
body size x. The reason for that is that as x
increases, the gap between the sexes becomes wider and therefore, so
does their respective distance from the global regression. The allometry
has not been corrected away because the trend with x is
different between the sexes.
Transforming the data so that the relationship looks similar between the sexes therefore sounds like a good solution to that problem. Since the relationship is allometric, what does log-transforming do, for example?
# Plot on a log-scale
plot_data(data_nl, midline = TRUE, linear = FALSE, transform = log, inverse = exp)
# Note: we have to provide the `inverse` of the re-scaling function.
Now the relationship with x seems to be approximately
identical in both sexes, which is what we want, because it means that if
we now correct for x, then the relationship with
x should be gone and only a male-female offset remain.
# Correct the transformed data
plot_data(data_nl, midline = TRUE, linear = FALSE, correct = TRUE, transform = log, inverse = exp)
Tadaaa! Actually, we could do even better, as we see that the relationship now looks linear but our re-scaled prediction does not seem to capture exactly that relationship (it is curved). What if we re-fit a regression after transforming the data?
# Re-fit after transforming
plot_data(data_nl, midline = TRUE, linear = FALSE, transform = log, inverse = exp, refit = TRUE)
The regressions seem to fit better. What happens when we correct for
x?
# Correct
plot_data(data_nl, midline = TRUE, linear = FALSE, transform = log, inverse = exp, refit = TRUE, correct = TRUE)
We can also choose to fit a linear regression instead of an allometric one the second time around, since the data look quite linear. This would give us:
# Correct
plot_data(data_nl, midline = TRUE, linear = FALSE, transform = log, inverse = exp, refit = TRUE, linear_refit = TRUE, correct = TRUE)
# Note: this was done with `linear_refit`.
Pretty god damn perfect.
Note that here we were lucky in the sense that the log-transformation
we picked did the job in making the two sex-specific lines parallel to
each other, such that when we corrected x away we were only
left with an offset. But in practice, we may need to tinker around more
to find the right transformation. For example, the function
log(x) / (1 + log(x)) has a stronger effect than
log(x) in reducing high values, in case the latter did not
do the job. Also note that this transformation need not make the
relationship linear: what matters is that the trend is the same in the
two sexes, so that we can measure an offset that is more-or-less the
same along the x axis.
This brings us to the second problem, that of unequal spread of the data along the horizontal axis. This is a common feature in biological data sets, especially when it comes to sex differences. For example, males may have wider heads than females, but they may also be bigger than females. So, how much the dimorphism in head shape can be explained simply by an allometric relationship with body size? Consider the following data set:
# Simulate heterogeneity along the horizontal axis
data_hg <- simulate_data(intercept_f = 0.1, intercept_m = 10, slope_f = 0.2, slope_m = 0.2, sd_y = 1, homogeneity = 0.4)
# Plot heterogeneous data
plot_data(data_hg)
# Note: `homogeneity` shrinks one sex to the left and the other to the right
# when smaller than one
For the sake of the example we have simulated linear relationships with identical slopes in both sexes. What happens if we fit a global regression through both sexes?
# Fit a global regression
plot_data(data_hg, midline = TRUE)
As you may have expected, the global regression passes through both
sexes, but is very much not parallel to the sex-specific lines, which
represent the true scaling relationship between x and
y. Because of heterogeneity of the groups along the
horizontal axis, the effect of x that the global regression
aims to capture is grossly over-estimated.
This means that when we correct for this āwrongā relationship with
x, we get:
# Plot the corrected data
plot_data(data_hg, midline = TRUE, correct = TRUE)
Which does not remove the scaling with x at all.
And so, when both sexes covered the same range on the horizontal
axis, a global regression managed to capture the ātrueā relationship
between x and y. That is no longer the case
when both sexes do not overlap along x.
To solve that problem, we must use something else than a global regression to estimate this ātrueā relationship. We clearly see this relationship when we look at each sex separately, so what if we found a middle ground between these two?
# With middle ground
plot_data(data_hg, midline = TRUE, separate = TRUE)
This looks much better, and if we now correct for x:
# Correct
plot_data(data_hg, midline = TRUE, separate = TRUE, correct = TRUE)
The dependency on x is gone and what is left is the
sexual dimorphism between males and females!
Let us now combine both problems into one, very problematic data set
where not only are the (non-linear) relationships of y with
x different in both sexes but in addition, both sexes are
unequally distributed along the horizontal axis.
# Simulate problematic data
data_pb <- simulate_data(intercept_f = 0.1, intercept_m = 0.5, slope_f = 0.5, slope_m = 2, fun = function(x) exp(x / 30), homogeneity = 0.7)
# Plot
plot_data(data_pb, linear = FALSE)
What is the sex difference that is not attributable to differences in
body size in this data set where the sexes differ not only in body size
but also in how y scales with body size?
Well, we can combine the techniques we have seen above.
First, we have heterogeneous data, so we know that we should use a mid-line between the two separate regressions rather than a global regression.
# Plot with mid-line
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE)
But the relationship is different in males and females so the correction will not workā¦
# Correct
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE, correct = TRUE)
This suggests we must transform the scale of the data. What about a log-transformation?
# Log-transform
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE, transform = log, inverse = exp)
This is better, but not great, as the relationship with
x is still present in the corrected data:
# Correct
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE, transform = log, inverse = exp, correct = TRUE)
A transformation by the function log(x) / (1 + log(x))
seems more appropriate.
# New transformation
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE, transform = function(x) log(x) / (1 + log(x)), inverse = function(y) exp(y / (1 - y)))
# Note: exp(y / (1 - y)) is the inverse of log(x) / (1 + log(x)).
However, the corrected data still does not look optimal:
# Correct
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE, transform = function(x) log(x) / (1 + log(x)), inverse = function(y) exp(y / (1 - y)), correct = TRUE)
But that is because we have not re-fitted the model after transforming the data. What if we do that?
# Re-fit after transforming
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE, transform = function(x) log(x) / (1 + log(x)), inverse = function(y) exp(y / (1 - y)), refit = TRUE)
## Warning in new_fun(transform(y)): NaNs produced
## Warning in new_fun(transform(y)): NaNs produced
This looks much better. And again, the relationship looks linear on this new scale so we might as well include that when we re-fit the model.
# Linear re--fitting
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE, transform = function(x) log(x) / (1 + log(x)), inverse = function(y) exp(y / (1 - y)), refit = TRUE, linear_refit = TRUE)
If we now correct for x:
# Correct
plot_data(data_pb, linear = FALSE, midline = TRUE, separate = TRUE, transform = function(x) log(x) / (1 + log(x)), inverse = function(y) exp(y / (1 - y)), refit = TRUE, linear_refit = TRUE, correct = TRUE)
Tadaaa!
More than two groups? Make this a package? Or a paper?
This report was generated with the following specifications:
# Show session information
sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Amsterdam
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.4
## [5] purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
## [9] ggplot2_3.5.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.5 jsonlite_1.8.4 highr_0.10 compiler_4.3.3
## [5] tidyselect_1.2.1 jquerylib_0.1.4 scales_1.3.0 yaml_2.3.7
## [9] fastmap_1.1.1 R6_2.5.1 labeling_0.4.3 generics_0.1.3
## [13] knitr_1.42 munsell_0.5.1 bslib_0.4.2 pillar_1.9.0
## [17] tzdb_0.3.0 rlang_1.1.3 utf8_1.2.4 stringi_1.7.12
## [21] cachem_1.0.7 xfun_0.39 sass_0.4.5 timechange_0.2.0
## [25] cli_3.6.2 withr_3.0.0 magrittr_2.0.3 digest_0.6.36
## [29] grid_4.3.3 rstudioapi_0.14 hms_1.1.3 lifecycle_1.0.4
## [33] vctrs_0.6.5 evaluate_0.20 glue_1.7.0 farver_2.1.2
## [37] fansi_1.0.6 colorspace_2.1-0 rmarkdown_2.21 tools_4.3.3
## [41] pkgconfig_2.0.3 htmltools_0.5.5